Genie Game Results (v1.2)

Exclusions

# table(full_pdf$obligatory_check)
pdf = full_pdf %>% 
    filter(obligatory_check == "correct") %>% 
    mutate(
        subj = row_number(),
        control = factor(control, levels=c("low", "high"))
    )

pdf2 = select(pdf, wid, subj, control)

unk_df = full_df %>% 
    right_join(pdf2) %>%  # also drops excluded participants
    filter(scenario != "PRACTICE") %>%
    group_by(wid) %>% 
    mutate(
        scenario = tolower(scenario),
        evaluation_z = zscore(evaluation),
        abs_eval = abs(evaluation),
        abs_eval_z = zscore(abs_eval),
        consideration=if_else(considered, "considered", "unconsidered")
    ) %>% 
    ungroup()

df = filter(unk_df, outcome != "UNK")
n_drop = nrow(full_pdf) - nrow(pdf)
n_drop_unk = nrow(unk_df) - nrow(df)
n_trial = nrow(df)

scenarios = full_scenarios %>%
    right_join(pdf2)  %>% 
    filter(scenario != "PRACTICE") %>%
    group_by(wid) %>% 
    mutate(
        scenario = tolower(scenario),
        scenario_evaluation_z = zscore(scenario_evaluation),
        trial_number = row_number(),
    ) %>% 
    ungroup()

df = left_join(df, select(scenarios, wid, scenario, scenario_evaluation, scenario_evaluation_z))

We exclude 1 participant(s) who answered the comprehension check incorrecttly (incorrect = saying that you can not take the outcome if you don’t like it), resulting in 54 participants.

We exclude 103 outcomes which were considered but not in the original set, resulting in 2677 observations. This was not my orginal plan, but I think this is the correct thing to do. See Out-of-set outcomes for an explanation.

Consideration probability by outcome value

How does the probability of an item being included in the considered set depend on its value (measured by post-decision rating)?

# BASELINE = mean(df$evaluation)

df %>% 
    ggplot(aes(evaluation, as.numeric(considered), color=control)) +
    stat_summary_bin(fun.data=mean_cl_boot, bins=5, alpha=0.5, 
                     position=position_dodge(width=.5)) +
    stat_smooth(geom="line", size=0.8, linetype = "dotted", alpha=0.5) +
    # geom_smooth(se=F, method=lm, formula=y ~ x + abs(x), alpha=0.1) +
    geom_smooth(se=F, method=glm, 
        formula=y ~ x + abs(x),
        method.args = list(family = "binomial"),
        alpha=0.1) +
    ylab("consideration probability") +
    control_colors

Reading the plot: points show binned means with 95% CI error bars. The dashed line shows a non-parameteric GAM fit. The solid line shows a logistic regression of the form \(\text{logit}(y) = \beta_0 + \beta_1 x + \beta_2 |x|\). This is an adaptation of the earlier apples/oranges version of our model.

Mixed-effects regression with random interecepts and slopes for signed and absolute value:

consideration_model = df %>% 
    glmer(considered ~ control * (evaluation + abs(evaluation)) + (evaluation + abs(evaluation) | wid),
     family=binomial, data=.)

plot_coefs(consideration_model, omit.coefs=c("controlhigh", "(Intercept)"), colors="black") #plot

summ(consideration_model)
Fixed Effects
Est. S.E. z val. p
(Intercept) -3.730 0.317 -11.763 0.000
controlhigh 0.187 0.394 0.475 0.635
evaluation -0.050 0.027 -1.867 0.062
abs(evaluation) 0.204 0.045 4.534 0.000
controlhigh:evaluation 0.139 0.040 3.462 0.001
controlhigh:abs(evaluation) -0.093 0.062 -1.500 0.134

The key predictions are confirmed:

  • Both groups are more likely to consider extreme outcomes.
  • Only the high-control group is more likely to consider positive outcomes.

We also see an unexpected trend (not quite significant) that the low-control group preferentially considers negative outcomes. I can think of two possible explanations for this

  1. People aren’t using the scale in a symmetric way; specifically, negative ratings are actually more extreme than the opposite positive ratings.
  2. People are biased towards sampling bad things.

Explanation (2) is a much more interesting explanation as it violates the prediction of the rational model—at least, as it is currently posed. However, (1) is also very plausible, so we would want to rule that out before interpreting this finding too much.

How can we ensure that people are using the scale symmetrically? One strategy is to add a phase at the end of the experiment where participants evaluate pairs of outcomes. We pick out specific outcomes that they evaluated as (roughtly) opposite around 0 and ask if they would be willing to have the bad thing happen if the good thing would also happen. If (1) is correct, then people will give respond no more than half the time. This feels not all that convincing and kind of complicated to me though.

👉 Do we want to add anything to the experiment to check whether participants are using the scale non-symetrically?

(I’m tagging action/discussion items with the 👉 emoji)

By scenario

We have eight scenarios, two for each category.

  • sports (season): Pick a professional sport and you’ll get a free pair of front-row tickets every week for one season. You have to go every week (and you can’t sell them!)
  • sports (silence): Pick a professional sport and you’ll never see or hear about it again in your life.
  • animals (intimate): Pick a zoo animal and you’ll spend 20 minutes in a cage with it.
  • animals (bubble): Pick a zoo animal and you’ll get to watch it in its natural habitat from a magical floating bubble for a few hours.
  • subjects (school): Pick an academic subject and you’ll have to pass the entry level course in that subject at a community college.
  • subjects (magic): Pick an academic subject and you’ll instantly gain the knowledge of a typical PhD in that subject.
  • vehicles (commute): Pick a mode of transportation and you’ll have to take it to work for the next year (it’ll be free!)
  • vehicles (veto): Pick a mode of transportation and you won’t be allowed to use it for the next year.

Here is the “check plot” for each:

wrap_scenario = list(
    facet_wrap(~scenario, dir="v", ncol=4),
    theme(strip.text.x = element_text(size=12), legend.position="top")
)

df %>% 
    ggplot(aes(evaluation, as.numeric(considered), color=control)) +
    stat_summary_bin(fun.data=mean_cl_boot, bins=5, alpha=0.5, 
                     position=position_dodge(width=.5)) +
    geom_smooth(se=F, method=glm, 
        formula=y ~ x + abs(x - 0),
        method.args = list(family = "binomial"),
        alpha=0.1) +
    ylab("consideration probability") +
    control_colors + wrap_scenario

Clearly some of the scenarios are doing a lot more work than others. We consider this breakdown further below.

Value distributions

To better understand the effect of scenario, we need to look at the underlying distributions of outcome values. Here we are plotting the distribution of all outcome values in gray and the distribution of considered outcomes in blue and yellow.

df %>% 
    ggplot(aes(evaluation, y=..prop..)) + 
    geom_bar(alpha=1, fill="gray50", position="identity") +
    geom_bar(aes(fill=control), data=filter(df, considered), alpha=0.7, position="identity") +
    control_colors + wrap_scenario + 
    labs(fill="considered by", y="proportion")

It’s hard to know what to make of this. Ideally, there would be some systematic relationship between the distribution of outcome values (gray) and consideration. But I don’t really see that.

For example, consider sports (silence) and vehicles (veto). These have very similar value distributions, but totally different consideration profiles. However, my intuition suggests that these situations don’t really have similar value distributions. Silencing a sport could actually be positive, but being banned from using some mode of transportation is pretty much strictly negative. This makes me think that people are using the scale in a relative way, such that +5 means “given that you are going to have to give up some mode of transport, boats would be a pretty good option.”

The problem is that the model won’t be able to do this kind of reasoning. It’s possible that the total scenario evaluation could provide some information here, but I’m not sure how exactly that would work. I see two reasonable paths here:

  1. Give up on the idea of making scenario-specific predictions with this paradigm. Use this paradigm to show the high-level prediction (the first plot) and then move to a paradigm with experimentally manipulated values to test the model at a finer level of granularity.
  2. Use a different experimental strategy to get absolute evaluations. For example, rather than grouping evaluations by scenario (which encourages scenario-dependent valuation), we could do all the evaluations at the end of the experiment, after they’ve rated each scenario.

👉 Do we want to make scenario-specific predictions? If so, how can we get ratings that are consistent across scenarios?

Tricky nuances

Out-of-set outcomes

For each scenario, we define a set of outcomes that each participant will rate. What do we do when a participant reports considering an outcome outside of that set? We collect a rating for those outcomes as well, and I was previously just including that as another observation. However, I don’t think this is correct: The value of the DV shouldn’t determine whether or not the item is included in the dataset. For one thing, this will result in systematic differences in the value distribution across conditions because the high-control group will consider more out-of-sample high-value options.

👉 In the next pilot, I think we should not even collect evaluations for out-of-sample outcomes. This ensures that everyone rates the exact same set of outcomes.

If we aren’t using out-of-set outcomes, then we would ideally have scenarios with small sets of outcomes that anyone might consider. Unfortunately, none of our categories excel here. This is the number and proportion of considered responses which were not in-set for each scenario

unk_df %>% 
    filter(considered) %>% 
    group_by(scenario) %>% 
    mutate(out=outcome=="UNK") %>% 
    summarise(n=sum(out), prop=mean(out)) %>% kable(digits=2)
scenario n prop
animals (bubble) 10 0.24
animals (intimate) 12 0.21
sports (season) 6 0.23
sports (silence) 7 0.18
subjects (magic) 24 0.75
subjects (school) 27 0.68
vehicles (commute) 15 0.33
vehicles (veto) 2 0.04

I’m not sure how much of a problem this is. It does mean that we have to throw out a lot of potentially useful consideration data. But I think there is nothing wrong in principle with asking: “out of these outcomes, what is the probability that each is considered as a function of value and condition?”

👉 Is it okay to have many out-of-set considered options? Or do we need to adjust our stimuli?

Value distribution by condition

Outcome valuation should be identically distributed across condition because they have the same set of outcomes and the quality of the outcome itself doesn’t depend on how it might have been selected (right?).

df %>% 
    ggplot(aes(evaluation, y=..prop.., fill=control)) + 
    geom_bar(alpha=0.7, position="identity") +
    control_colors + wrap_scenario +
    labs(fill="evaluated by", y="proportion")

df %>% 
    mutate(scenario = fct_reorder(scenario, evaluation)) %>% 
    ggplot(aes(scenario, y=evaluation, color=control)) + 
    # geom_quasirandom(dodge.width=.8, size=.1) +
    # geom_boxplot(position=position_dodge(width=.5)) +
    stat_summary(fun.data=mean_cl_boot, position=position_dodge(width=.5)) +
    control_colors + coord_flip()

This does not appear to hold. It looks like low control tends to give more extreme values. This does provide additional evidence that the slider ratings aren’t really given on an absolute utility scale. I’m not sure how much of a problem this is.

👉 Is it okay that the value distributions vary systematically by condition?

Discussion

I’ve highlighted specific action items above. Here they are again:

  • Do we want to add anything to the experiment to check whether participants are using the scale non-symetrically?
  • Do we want to make scenario-specific predictions? If so, how can we get ratings that are consistent across scenarios?
  • In the next pilot, I think we should not even collect evaluations for out-of-sample outcomes. This ensures that everyone rates the exact same set of outcomes.
  • Is it okay to have many out-of-set considered options? Or do we need to adjust our stimuli?
  • Is it okay that the value distributions vary systematically by condition?

A theme across several of these is that participants are likely not giving outcome ratings on an absolute scale. If we are going to do any interesting modeling, we will need to either resolve this issue or move to a paradigm in which we can objectively quantify (and ideally manipulate) outcome value. I think for CogSci, we are probably fine with the basic qualitative effect, but if there are any easy things we can try, it’s probably worth a shot.

Quality/sanity checks

Slider scaling

To give people reference points for the slider, we ask them to evaluate seven different events. People seem to give fairly reliable responses, except for a couple jokesters who rate eating a spoonfull of rice as maximally good, which I’ll admit is pretty funny.

slider = full_slider %>% right_join(pdf2) %>% rename(evaluation="response")

slider %>% 
    mutate(prompt = fct_reorder(prompt, evaluation)) %>% 
    ggplot(aes(prompt, evaluation)) +
    # geom_quasirandom(size=.8) +
    geom_line(aes(group=wid), alpha=0.5, size=0.5) +
    stat_summary(fun.data=mean_cl_boot, color="red") +
    coord_flip() + xlab("")

Here is mean and standard deviation for each participant.

slider %>% 
    mutate(wid = fct_reorder(wid, evaluation, .fun=sd)) %>% 
    ggplot(aes(wid, evaluation)) +
    stat_summary(fun.data=mean_sdl) +
    scale_x_discrete(breaks=NULL) +
    xlab("participant")

Higher outcome evaluations in high-control

Perhaps the most obvious and basic prediction one could make is that the scenarios should be rated more highly in the high control condition.

scenarios %>% 
    ggplot(aes(scenario, scenario_evaluation, color=control)) +
    stat_summary(fun.data=mean_cl_boot, position=position_dodge(width=.35)) +
    # geom_quasirandom(size=.8, dodge.width=.8, alpha=0.5) + 
    control_colors + coord_flip()

Mixed effects regression with random intercepts for participant and scenario:

scenarios %>% lmer(scenario_evaluation ~ control + (1|wid) + (1|scenario), data=.) %>% summ
Fixed Effects
Est. S.E. t val. d.f. p
(Intercept) -0.191 1.463 -0.131 8.398 0.899
controlhigh 1.814 0.764 2.374 51.128 0.021
p values calculated using Satterthwaite d.f.

Notably, we see that low-control participants give higher evaluations for the scenarios where there is really no bad outcome. One possible explanation for this is that people use previous trials as reference points. This predicts that the low-control group will start giving higher evaluations later in the experiment. We sort of see this (note that trial 0 is the practice trial, which is always a vacation in some European city).

full_scenarios %>% 
    group_by(wid) %>% 
    mutate(trial_number = row_number()-1) %>%
    ungroup() %>% 
    right_join(pdf2) %>% 
    ggplot(aes(trial_number, scenario_evaluation, color=control)) +
    stat_summary(fun.data=mean_cl_boot, position=position_dodge(width=.35)) +
    # geom_quasirandom(size=.8, dodge.width=.8, alpha=0.5) + 
    control_colors

Outcome evaluations predict scenario evaluations

Evaluations of the full scenario should depend on evaluations of individual outcomes of that scenario. The simplest thing is to just look at the correlation between the two types of evaluation.

library(ggpubr)
df %>% 
    ggplot(aes(evaluation, scenario_evaluation)) +
    geom_point(size=.3, alpha=.5, position="jitter") + 
    # geom_bin_2d(breaks=seq(-10.5, 10.5)) + 
    geom_smooth(method="lm", se=F) +
    stat_cor(method = "pearson", p.accuracy = 0.001, r.accuracy = 0.01, label.y=11.5) +
    ylim(-10, 12) +
    facet_wrap(~consideration)

It also looks like it holds in individual scenarios.

df %>% 
    ggplot(aes(evaluation, scenario_evaluation, color=consideration)) +
    geom_point(size=.3, alpha=.5, position="jitter") + 
    wrap_scenario + considered_colors +
    geom_smooth(method="lm", se=F)

For a statistical test, I think looking for an interaction in a regression makes sense. However, this is using each scenario evaluation (which is, confusingly, the outcome variable) multiple times (once for each outcome evaluation). Not sure if that’s kosher. The regression has random intercepts for scenario and participant and random participant slopes for each fixed effect.

lmer(scenario_evaluation ~ evaluation * considered + (evaluation*considered|wid) + (1|scenario), data=df) %>% summ
Fixed Effects
Est. S.E. t val. d.f. p
(Intercept) 0.728 1.269 0.574 8.525 0.581
evaluation 0.185 0.036 5.144 53.842 0.000
consideredTRUE -0.198 0.281 -0.706 131.422 0.481
evaluation:consideredTRUE 0.118 0.041 2.852 53.507 0.006
p values calculated using Satterthwaite d.f.

Number of options considered

This is pretty self-explanatory. Overall, it seems like we are getting pretty decent rates.

scenarios %>% 
    ggplot(aes(n_considered, fill=control, y=..prop..)) + 
    geom_bar(position="identity", alpha=0.6) + control_colors